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ABSTRACT 

The sparse vector technique is a powerful differentially 
private primitive that allows an analyst to check whether 
queries in a stream are greater or lesser than a thresh¬ 
old. This technique has a unique property - the algo¬ 
rithm works by adding noise with a finite variance to 
the queries and the threshold, and guarantees privacy 
that only degrades with (a) the maximum sensitivity of 
any one query in stream, and (b) the number of positive 
answers output by the algorithm. Recent work has de¬ 
veloped variants of this algorithm, which we call gener¬ 
alized private threshold testing, and are claimed to have 
privacy guarantees that do not depend on the number 
of positive or negative answers output by the algorithm. 
These algorithms result in a significant improvement in 
utility over the sparse vector technique for a given pri¬ 
vacy budget, and have found applications in frequent 
itemset mining, feature selection in machine learning 
and generating synthetic data. 

In this paper we critically analyze the privacy prop¬ 
erties of generalized private threshold testing. We show 
that generalized private threshold testing does not sat¬ 
isfy e-differential privacy for any finite e. We identify 
a subtle error in the privacy analysis of this technique 
in prior work. Moreover, we show an adversary can use 
generalized private threshold testing to recover counts 
from the datasets (especially small counts) exactly with 
high accuracy, and thus can result in individuals being 
reidentified. We demonstrate our attacks empirically on 
real datasets. 

1. INTRODUCTION 

A popular building block for e-differentially private 
query answering is the Laplace mechanism. Given a set 
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of queries Q as input, the Laplace mechanism adds noise 
drawn independently from the Laplace distribution to 
each query in Q. Adding noise with standard deviation 
of v^/e to each of the queries in Q ensures (Aq ■ e)- 
differential privacy, where Aq is the sensitivity of Q, 
or the sum of the changes in each of the queries Q £ 
Q when one row is added or removed from the input 
database. Increasing the number of queries increases 
the sensitivity, and thus for a fixed privacy budget the 
mechanism’s accuracy is poor for large sets of queries 
(unless the queries operate on disjoint subsets of the 
domain). 

The sparse vector technique (SVT) is an algorithm 
that allows testing whether a stream of queries is greater 
or lesser than a threshold r. SVT works by adding 
noise to both the threshold r and to each of the queries 
Q G Q. If noise with standard deviation of \/2/e is 
added to the threshold and each of the queries, SVT 
can be shown to satisfy cAe-differential privacy, where 
A is the maximum sensitivity of any single query in Q 
and c is the number of positive answers (greater than 
threshold) that the algorithm outputs. Note that the 
privacy guarantee does not depend on the number of 
queries with negative answers, and the sensitivity does 
not necessarily increase with an increase in number of 
queries. 

Recent work has explored the possibility of extending 
this technique to eliminate the dependence on the num¬ 
ber of positive answers (c). We call this idea generalized 
private threshold testing, and it works like SVT - noise 
is added to both the threshold and each of the queries 
using noise whose standard deviation only depends only 
on e and maximum sensitivity A of a single query in Q. 
Generalized private threshold testing has been claimed 
to ensure differential privacy with the privacy parame¬ 
ter having no dependence on the number of positive or 
negative queries! Hence, generalized private threshold 
testing has been used to develop algorithms with high 
utility for private frequent itemset mining [5], feature 
selection in private classification [9] and generating syn¬ 
thetic data [2]. 

In this article, we critically analyze the privacy prop¬ 
erties of generalized private threshold testing. We make 
the following contributions: 
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• We show that generalized private threshold testing 
does not satisfy e-differential privacy, where e does 
not depend on the number of queries being tested. 
We identify a specific claim in the privacy analysis 
in prior work that is assumed to hold, but does not 
in reality. 

• We show specific examples of neighboring databases, 
queries and outputs that violate the requirement 
that the algorithm output is insensitive to adding 
or removing a row in the data. 

• We display an attack algorithm and demonstrate 
that using generalized private threshold testing 
could make it possible for adversaries to recon¬ 
struct the counts for each cell with high probabil¬ 
ity, especially the cells with small counts. 

Organization: Section 0 surveys concepts on differen¬ 
tial privacy and the sparse vector technique. In Sec¬ 
tion [31 we introduce generalized private threshold test¬ 
ing and its instantiations in prior works. We show that 
generalized private threshold testing does not satisfy 
differential privacy in Section ( 4 ] We describe an at¬ 
tack algorithm for reconstructing the counts of cells in 
the input datasets by using generalized private thresh¬ 
old testing in Section [ 5 ] and demonstrate our attacks on 
real datasets. 

2. PRELIMINARIES 

Databases: A database D is a multiset of entries whose 
values come from a domain T = {ui,U2,... ,Uk}- Let 
n = |D| denote the number of entries in the database. 
We represent a database D as a histogram of counts 
over the domain. That is, D is represented as a vector 
X € where x[i] or Xi denotes the true count of entries 
in D with the i*'' value of the domain T. 

Differential Privacy: We define a neighborhood rela¬ 
tion N on databases as follows: Two databases Di and 
D2 are considered neighboring datasets if and only if 
they differ in the presence or absence of a single entry. 
That is, (Di, D2) € Ai iff for some f € T, Di = D2 U {f} 
or D2 ~ D\ U {t}. Equivalently, if xi and X2 are his¬ 
tograms of neighboring databases, ||xi — X2II1 = 1. An 
algorithm satisfies differential privacy if its outputs are 
statistically similar on neighboring databases. 

Definition 1 (e-differential privacy). A ran¬ 
domized algorithm M satisfies e-differential privacy if 
for any pair of neighboring databases {D\,D2) € N, 
and VS' £ range(M), 

Pr[M{Di) = S] < e‘■ P{M{D2) = S] (1) 

The value of e, called privacy budget, controls the level 
of the privacy, and limits how much an adversary can 
distinguish one dataset with its neighboring datasets 
given an output. Smaller e’s correspond to more privacy. 


Differentially private algorithms satisfy the following 
composition properties. Suppose Mi(-) and M2{ ) be ei- 
and 62-differentially private algorithms. 

• Sequential Compositon: Releasing the outputs of 
Ml (D) and M2 {D) satisfies ei -f e2-differential pri¬ 
vacy. 

• Parallel Composition: Releasing Mi (Di) andM2(D2), 
where Dir]D2 = 0 satisfies max(ei,e2)-differential 
privacy. 

• Postprocessing: For any algorithm Mff-), releas¬ 
ing Ms{Mi{D)) still satisfies 61-differential pri¬ 
vacy. That is, postprocessing an output of a dif¬ 
ferentially private algorithm does not incur any 
additional loss of privacy. 

Thus, complex differentially private algorithms can be 
build by composing simpler private algorithms. Laplace 
Mechanism [ 3 ] is one such widely used building that 
achieves differential privacy that adds noise from a Laplace 
distribution with a scale proportional to the global sen¬ 
sitivity. 

Definition 2 (Global Sensitivity). The global sen¬ 
sitivity of a fuction /:!>—>■ R", denoted as A(/), is 
defined to be the maximum Li distance of the output 
from any two neighboring datasets Di and D2. 

A(/)= max ||/(Di)-/(D2)||i (2) 

(Dl.DaieiV 

Definitions (Laplace Mechanism). For any func¬ 
tion f :T> ^ the Laplace Mechanism M is given by: 
M(D) = f(D) -h rj. rj is a vector of independent ran¬ 
dom variables drawn from a Laplace distribution with 
the probability density fuction p{x\X) = where 

A = A(/)/6. 

Theorem 1 . Laplace Mechanism satisfies e-differential 
privacy. 

Sparse Vector Technique: 

Algorithm [T] shows the details of the sparse vector 
technique (SVT). The input of SVT is a stream of queries 
Q — {qi,q2,... ,qk}, where each query q € Q has sensi¬ 
tivity bounded by A, a threshold 9, and a limit c. For 
every query, SVT outputs either T (negative response) 
or T (positive response). SVT works in two steps: ( 1 ) 
Perturb the threshold 6 by adding noise drawn from the 
Laplace distribution with scale |, getting 9. ( 2 ) Perturb 
each query qi adding Laplace noise {Lap{^)) getting 
q. Output T ii (fi < 9 and T otherwise. This algorithm 
stops when it outputs c positive responses. 

Theorem 2. EF SVT satisfies e-differential privacy. 
Moreover if the number of queries is k and their max 
sensitivity A, with probability at least 1 — 5, for every 
ai = T, qi > T — a and for every at = T, qi < t -\- a, 
where 

cA 

a = 0(— • (log k -I- log( 2 / 5 )) 
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Algorithm 1 Sparse Vector Technique 
Input: Dataset D, a stream of queries qi,q 2 ,. ■. with 
bounded sensitivity A, threshold 6 , a cutoff point c and 
privacy budget e 
Output: a stream of answers 
1 : 0 <— 9 + Lap{ 2 A/e), count <— 0 
2: for each query i do 
3 : Vi Lap{ 2 A * c/e) 

4: if qi{D) -\- Vi>9 then 

5 : Output Vi = T 

6: count t— count + 1 

7: else 

8: Output Vi = 1 . 

9: end if 

10: if count > c then 

11 : Abort 

12: end if 

13 : end for 


3. GENERALIZED PRIVATE THRESH¬ 
OLD TESTING 

In this section, we describe a method called Gener¬ 
alized Private Threshold Testing (GPTT) (see Algo¬ 
rithmic]) that generalizes variations of the sparse vector 
technique that do not require a limit on the number of 
positive (or negative) responses. 

GPTT takes as input a dataset D, a set of queries 
Q = with bounded sensitivity A, thresh¬ 

old 9 and a privacy budget e. For every query GPTT 
outputs either T or T that approximates whether or 
not the queries are smaller than the threshold. GPTT 
works exactly like SVT - the threshold is perturbed us¬ 
ing noise drawn from Lap(A/ei) and the queries are 
perturbed using noise from Lap{A/c2), and the output 
is computed by comparing the noisy query answer with 
the noisy threshold. The only difference is that there is 
no limit on the number of positive or negative queries. 

GPTT is a generalization of variations presented in 
prior work. Lee and Clifton [6] used GPTT for pri¬ 
vate frequent itemset mining with ei = | and £2 = %. 
Chen et al [ 2 ] instantiate GPTT with ei = £2 = f, for 
generating synthetic data. Stoddard et al [ 5 ] observed 
that the privacy guarantee does not depend on £2 and 
propose the Private Threshold Testing algorithm that 
is identical to GPTT with £i = £ and £2 = 00. Private 
threshold testing was used for private feature selection 
for classihcation. 

3.1 Privacy Analysis of GPTT 

We now extend the privacy analysis from prior work 
[SHH ID to generalized private threshold testing. We 
will show in the next section that this privacy analysis 
is flawed and that GPTT does not satisfy differential 
privacy. 

Given any set of queries Q = {qi,... ,qn}, let the 
vector V =< vi, ... ,Vn >£ {T, T}" denote the output 
of GPTT. Given any two neighboring databases D\ and 


Algorithm 2 Generalized Private Threshold Testing 
Input: Dataset D, a set of queries Q = {51,... , with 
bounded sensitivity A, threshold 9, privacy parameters 
£1, £2 

Output: A vector of answers v = [v\,V2, ■ ■ ■ ,Vn\ £ 

{T,Tr 

1: 99 + Lap{A/e\) 

2: for qi € Q do 

3: qi ^ qi[D) + Lap{A/e 2 ) 

4: if qi < 9 then 

5: Ui ■4— T 

6: else 

7 : Vi <-T 

8: end if 

9: end for 
10: return v 


D 2 , let Vi and V 2 denote the output distribution on v 
when Di and D 2 are the input databases, respectively. 

We use v”'* to denote t—1 previous answers(i.e., =< 
wi,..., vt-i >). Then we have 

^i(v) ^ nr^i^i(^4 =ai I v<') 

’^2(v) nr=i ^2(wi = ai I v<*) 

_ TT Vl{vi = T I V^*) -p-r Vl{vi = T I v^*) 

- ^n^V^(„, = T|v<*)^n^l4(n.=T|v<0 

Let Hi(x) be the probability that qi is positive (i.e., 

Vi = T) in D when the noisy threshold is x. That is, 

Hi{x) = P[vi = T\9 = x,v<^] 

Then, given a specific noisy threshold 9 = x, the prob¬ 
ability that Vi = T is independent of the answers to 
previous queries. That is, 

Hiix) = P[vi = T I x, v<'] = P[vi = T I a;] (3) 

Thus, if f{y,p,X) = ^ea;p(--l^^), then 

f°° A f°° A 

Hi{x) = / f{y;qi,—)dy = / f{y;qi + A,—)dy 

J X ^2 J x-t-A ^2 

Prior work uses the above property of Hi{x) to show 
that GPTT satisfies 2£i-differential privacy. 

Let 5" = {i I fli = T and qi{Di) = qi{D 2 )} and 
5 = {i I fli = T and qi{D\) / qi{D 2 )}. 

n Vi{vi = T I v<‘) = l/i(i,i = T I v<') n Vi(vi = T 1 v< 

i:ai=T iGS i£S 

Let Hi (x) and Hi (x) denote the probability that Wi = T 
in Di and D 2 , resp., when the noisy threshold is x. Then 
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we have, 


Since 6 = 6 -\- Lap{-^), we have 


nK(«,=Ti v<^) 


i€S 


/ oo ___ 

P[e = x]llHlix)d: 

i£S 


/ oo ___ 

P[9 = »] ]^ Hi{x)dx since qi{Di) = qi{D 2 ) 
ies 

= YlV2{vi = T \v<*) 
ies 


< 


nv'iK=Ti v<“) 

iGS 

/ oo __ 

P[0 = *1 H}{x)dx 
■°° ieS 

/ OO ___ 

= a: — A] Hi {x — A)dx 

-OO y- c< 


iGS 


= nV" 2 (ni = T|v<^) 

iGS 

Thus, it is seen that 

Vi(ui = T I v^*) < ea;p(ei) F 2 (ui = T | v^‘) 


Similarly, 

V"i(ui = T I v^*) < ea;p(ei) l/ 2 (ui = T | v^‘) 

i:aj=_L i:aj=_L 

Therefore, < exp{2ei) 

Remarks: Note that adding noise to the queries is not 
really required. The above proof will go through even if 
£2 = oo. In fact as we will see next, we can achieve the 
same utility no matter what the value of £2 is. 


P{\9-9\) <a)>l-S 

=> P{—a < Lap{ — ) < a) > 1 — S 

£1 

^ P{Lap{^^)>a)<^- 

=> a > — log(-) 

£l 0 

□ 

Now we extend the utility for GPTT when £2 < 00. 

Theorem 4. Let D be a database and Q a query set 
with maximum sensitivity of A. For every P,5 > 0, we 
can use GPTT with parameters £ 1 , £2 < 00 to determine 
whether qi < 9 -\- a or qi > 9 — a for any qi G Q, with 
probability (1 — 5)(1 — ft), where 

a = — log(-) 

£i 0 

Proof. Since the privacy of GPTT does not depend 
on the number of queries Q (as long as sensitivity is 
bounded by A), we can consider a new query set Q' 
that has t copies {qi^, qa, ■ ■ ■ ,qit} of each query qi £ Q. 
Then for each query qi £ Q, we have t independent 
comparisons of the noisy query answer qij and the noisy 
threshold 9. We use the majority of these t results to 
determine whether qi is smaller or greater than 9. 

We can show that with probability at least 1 — / 3 , we 
can correctly identify whether qi{D) is greater or lesser 
than the noisy threshold 9. 

Without loss of generality, suppose qi < 9, then p = 
Pifi < 9) > i. Let {Xj,j > 1 } be an sequence of 
i.i.d. binary random variables with expected E[Xj] — p, 
where Xj is 1 if qij < 9 and 0 otherwise. Based on the 
law of large numbers, for any positive number 7 we have 


3.2 Utility of GPTT 

We first consider the utility of the case when £2 = 00 
(i.e. no noise added to queries), and show that we can 
achieve (almost) the same utility even when £2 is finite. 

Theorem 3. For GPTT with parameters ei and £2 = 
00, with the probability at least 1 — S , Vi = T implies 
qi < 9 + a and Vi = T implies qi > 9 — a, where 

a= — log(i) 

ei 0 

where A is the max sensitivity of input queries. 

Proof. All we need to show is that the noise added 
to the threshold is at most ±q with probability 1 — h. 


Urn Pr{\\{Xi + ■ ■ ■ + Xt) - p\ > 7 ) = 0 

t—¥oo t 

^ lim Pr{X\ H- + Xt > 7 ) = 1 

t—¥oo 2 

Thus, there exists a number t s.t. for every qi £ 
Q, we can determine whether qi is smaller or greater 
than 9 with probability greater than 1 — . So all the 

qi will be correctly judged with probability equals to 

We get the desired result by now combining with the 
proof of Theorem [S] □ 

We can see that the information leaked by GPTT with 
£2 < 00 will tend to the information leaked by GPTT 
with £2 = 00 as the number of copies t goes to infinity. 
Thus, we can just focus on the case when £2 = 00. 
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4. GPTT IS NOT PRIVATE 

While prior work claims that GPTT is indeed dif¬ 
ferentially private (as discussed in Section 0 ), we show 
that this algorithm does not satisfy the privacy condi¬ 
tion. The proof is constructive and will show examples 
of pairs of neighboring datasets and queries for which 
GPTT violates differential privacy. 

Theorem 5. GPTT does not satisfy e-differential pri¬ 
vacy for any finite e. 

In the proof of this theorem, we start with the case 
of £2 = oo, where the true query answer is compared 
with the noisy threshold. It is easy to show that GPTT 
does not satisfy differential privacy in this case, since 
deterministic information about the queries is leaked. 
In particular, if Ui = _L and Vj = T, we are certain 
that on the input database D, there is some x such that 
qi{D) < X < qj{D). 

The proof of the more general case follows from The¬ 
orem [4] which shows that anything that is disclosed by 
GPTT with £2 = oo is also disclosed with high probabil¬ 
ity (by making sufficient number of copies of the input 
queries). We present the formal proof below. 

Proof. Consider two queries qi and q 2 with sensi¬ 
tivity 1. For special case of GPTT, where £2 = oo, 
suppose in dataset D, qi{D) — 0 and 92 ( 0 ) = 1- Also 
suppose in a neighboring dataset D', q\[D') = 1 and 
q 2 {D') — 0. Let the threshold 9 be 0. Then, the prob¬ 
ability of getting an output ui = _L and r )2 = T is > 0 
under database D- this corresponds to the probability 
that the noisy threshold is within (0,1). However, un¬ 
der the neighboring dataset D', P[vi = T,V 2 = T] =0. 
This is because q\[D') > q 2 [D'). Thus for any noisy 
threshold 6, 

= T qi(D') < 9 q 2 {D') < 9 U 2 = -L 

Hence, GPTT with £2 = 00 does not satisfy differential 
privacy. 

To prove that GPTT does not satisfy differential pri¬ 
vacy when £2 < 00 , we construct a similar counterex¬ 
ample as above, except that we use t copies of q\ and 
52 - So again, let the query set Q = {qi,... ,q 2 t} be 
such that on dataset D, qi{D) = ... — qt{D) = 0 and 
qt+i{D) = ... = q 2 t{D) — 1. We assume all queries 
have sensitivity 1. On neighboring dataset D', qi{D) = 

... = qt{D) = 1 and qt-\-i{D) = ... = q 2 t{D) = 0. Let 
the threshold 9 = Q. Let the output vector v be such 
that v\ = ... = vt = T and Vt+i = ... = V 2 t = P. Then 
we have: 

y(v) = P[GPTT{D) = v] 

t 2t 

p{9 = z)Y\P{qi < z) P{qi>z)dz 

i=l i = t+l 

/ oo 

U,{z){F,,{z){l-F,,{z-l))fdz 

-OO 

/ oo 

f,,iz){F,,{z)-F,,{z)F,,iz-l)Ydz 

-OO 



where fe{z) and Fe{z) are the pdf and cdf respectively of 
the Laplace distribution with parameter 1/e. Similarly, 
we have on the neighboring database D', 


H'(v) = P[GPTT{D') = v] 


r 


{z){F,, iz-1)- F,, {z)F,, {z - l)Ydz 


Let H'(v) = OL and let d = l-P’ei'^(f-)l- Since a < 1, 5 is 
greater than the 75*^ percentile of a Laplace distribution 
with scale 1/ci. That is, 

n — 8 noo 

- = y feAt)dt + f€i{t)dt 


Moreover, note that since Fe^{z — 1) < Fe 2 {z) for all 2 , 
we have 

1 ^ kIz ) = -^^=2 (^) ~ -^£2 { z ,) F ^2 {z ~ 1 ) 

F',,( 2 - 1 )-F,,( 2 )F,,( 2 - 1 ) 

Let K denote the minimum value k{z) takes over all 
2 G [—5,5]; thus K > 1. 

Now we get 


I/'(v) = a 

/ OO 

/,,( 2 )(F ,,(2 - 1 ) - F,,( 2 )F ,,(2 - l))*d 2 

-OO 


L 


z^[-5,5] 
5 


{z)dz -I- 




f + j_JeAz){F,2{z - 1) - F,2{^)F,2{z - YYdt 


( 2 )(F,,( 2 - 1 )-F,,( 2 )F ,,(2 


Thus, we have 

rS 


f /.i(2 )(F’,,(2 - 1) - F,,(2)F,,(2 - l))‘d2 

J-5 




/.,( 2 )(F '.,(2 - 1 ) - F.,( 2 )F .,(2 - l))‘d 2 

-OO 

Therefore, 

/ OO 

/.,( 2 )(F.,( 2 )-F,,( 2 )F,,( 2 -l))‘d 2 

- OO 


> 


> 


> 


0- 

,t POO 


( 2 )(F',,( 2 )-F,,( 2 )F',,( 2 -l))*d 2 


(t)(F,,(2 - 1) - F,2{z)F.2{z - IfY^dz 


/ OO 

/.I 

-00 


( 2 )(F ,,(2 - 1 ) - F',,( 2 )F ',,(2 - l))*d 2 


= 


Since k > 1, for every £ > 1 there exists a t such that 
l/(v) > e‘V'{'v) which violates differential privacy. □ 


— I))*d 2 
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4.1 Intuition 

We believe that there is a subtle error in the privacy 
analysis in prior work (discussed in Section [3]). Prior 
work splits the probability P(v) as: 

V"(v) = P[ui=_L|v<‘] P[t;i = T|v<‘] 

= i\vi=~V 

= j f{x) P{vi = 

X J f{x) P[ui = T|x, V^*]d 2 ; 

i\vi = ~V 

= / f{x) P[ui = l.\x\dx 

x//w n P[vi = -T\x\dx 

i\vi = ~V 

where fix) = P[6 = x\. This decomposition into T 
and T answers are wrong. The main problem comes 
from the fact that it uses the unconditioned fix) for 
all queries, but the distribution of the noisy threshold 
would be affected given the previous output. To take 
a simple example, let gi = m > 0, q 2 = 0, 0 = 0 and 
assume ui = T, U 2 = T. For ease of explanation assume 
£2 = oo (but the argument would work for finite £2 as 
well). Now we can compute the probability of GPTT 
outputing ui, V 2 as 

P(ui = T, U2 = T) = P(ui = T)P(wi = T I U2 = T) 

= P(ui = T)P(m < 0 I 0 > 0) = 0 


However, if we use the expression above, we have 

^(v) = n = -L|v<*] n -P[«i=T|v<T 


- !■ 


fix)Pivi = T I x)dx X J /(x)P(u 2 = T I x)dx 

= J /(a:)P(m < x)dx x J /(a;)P(0 > x)dx 
= P(m) X (l-P(O)) >0 


where Fix) is the distribution function of the noisy 
threshold. 

Actually, the right decomposition should be 


^(v) = n -Ph=-Liv"^*] n ■Pbi=Tiv<*] 

= i\Vi = ~V 

= j fix\v^^)P[vi = l.\x,v^'-]dx 

X [ /(a^|v^*)P[ui = T|a;, 

= Yl J /(a:|v^*)P[wi = T|a;]da; 

i:vi = A. 

X j fix\v^'‘)P[vi = T\x]dx 

i:vi = T 


Algorithm 3 Attack Algorithm 
Input: Dataset D with domain T, privacy param £ 
Output: A partitioning of the domain P = {Pi,..., Pp} 
1: Let Q{dif fixu.Xv) \ Vu, uGT} 

2: Let 6^ |■ilog(i)] 

3: Run GPTT(£i = £, £2 = 00 ) on database D, queries 
Q and threshold 6. 

4: Vu G T, Let largeriv) be the set {u G T \ 
GPTT outputs T for diffixu,Xv)} 

5: Construct ordered partition of the domain P = 
(Pi,.. ., Pp}, such that 

Vu, u G Pi, largeriu) = largeriv), and 
Vm G Pi,v G Pi+i, largeriv) C largeriu) 

6: return P 


5. RECONSTRUCTING THE DATA US¬ 
ING GPTT 

In the last section, we showed that generalized pri¬ 
vate threshold testing does not satisfy £-differential pri¬ 
vacy for any e. While this is an interesting result, it 
still leaves open whether GPTT indeed leaks a signif¬ 
icant amount of information from the dataset, and al¬ 
lows attacks like re-identi£cation of individuals based 
on quasi-identifiers. In this section, we answer this 
question in the affirmative, and show that generalized 
private threshold testing may disclose the exact counts 
of domain values. Exact disclosure of cells with small 
counts (especially, cells with counts 0, 1 and 2) reveal 
the presence of unique individuals in the data who can 
be susceptible to reidentification attacks. 

We will show our attack for the special case of GPTT 
where £2 = 00 . Since GPTT with finite £2 can be made 
to leak as much information about a set of queries as 
GPTT with £2 = 00 (Theorem IdJ with high probability, 
we will not separately consider that case. 

In our attack, we use a set of difference queries that 
compute the difference between the counts of pairs of 
domain elements. 

Definition 4 (Difference Query). Letui , u 2 G 
P be a pair of domain elements, and let xi and X 2 
be their counts in a dataset D. The difference query 
diffiui,U 2 ) is given by 

diffiui,U2) = X1 — X2 

Note that each diffiu,v) query has sensitivity A = 1. 

Our attack algorithm is defined in Algorithm|3l Given 
the input dataset D with domain T, we apply GPTT 
(£2 = 00 ) to the set of all difference queries using all 
pairs of domain elements u,v G T. We use a threshold 
6 = [i log (I)]. Pairs of domain elements u,v gT are 
grouped together if for all w G T, GPTT outputs the 
same value for both dif fiu,w) and diffiv,w). This 
results in a partitioning of the domain. Further, for 
every domain element u, we define largeriu) to be the 
set of u G T such that GPTT output T for diffiv,u). 
These are the domain elements that satisfy Xv — Xu > d, 
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where 6 is the noisy threshold. We order the partitions 
such that elements u € Pi have a bigger larger {u) set 
than elements v £ Pj, for j > i. 

We can show that the ordered partitioning V imposes 
an ordering on the counts in the database D. 

Lemma 1. Let D be a database on domain T. Let 
P = {Pi,.. ., Pp} be the ordered paritioning ofT output 
by Algorithm\^ Then with probability at least 1 — S, for 
all 1 < £ < m < p, Ui £ Pe,Uj £ Pm, we have Xi < Xj. 

Proof. Let 1 < £ < m < p, and let 9 be the noisy 
threshold. Since 9 = [ilog(i)], with probability at 
least 1 — (5, 0 > 0. For any Ui £ Pi and Uj £ Pm, 
larger{uj) C larger{ui). Therefore, there exists Uk £T 
such that Xk — Xi > 9, but Xk — Xj ^ 9. Therefore, 

Xi < Xj. □ 

Let Si d T denote the set of domain elements that 
have count equal to i in dataset D. It is easy to see 
that for every Si there is some P £ V output by Al¬ 
gorithm 13 such that Si d P. We next show that for 
certain datasets there is an m > 0 such that the sets 
corresponding to small counts 0 < i < m are exactly 
reproduced in the partitioning output by Algorithm O 

Theorem 6 . Let V = (Pq, Pi,..., Pp} he the ordered 
partition output by Algorithm\^ on D with parameter t. 
Let D be a dataset such that Si 7 ^ 0 for all i £ [0,fc]. 
That is, D contains at least one domain element with 
count equal to 0,1, 2,..., k. Let a = \\ log j]. If k > 
2a then with probability at least 1 — 5, for all i £ [0, m], 
Pi = Si, where m = k — 2a. 

Proof. Since 0 = a = [i log , with probability at 
least 1 — 5, the noisy threshold 9 will be within [0, 2a]. 
For the dataset D such that Si 7 ^ 0 for all i £ [0, fc], 
where k > 2a, for any i £ [0, m — 1], suppose u £ Si 
and V £ Si+i, then we have larger(u) = {z \ Xz > 9 + i} 
and larger{v) = {z \ Xz > 9 + i + 1}. 

Since i £ [0, m], with probability at least 1 — 5, 9 + i < 
m + 2a < k. Thus 0 7 ^ C larger{u) \ larger{v). 

So we have larger{v) C larger{u) and Si,Si-i-i will not 
appear in the same Pi £ P. 

Furthermore, since we know So,... Sm belong to sepa¬ 
rate Pi and we have P = (Pq, Pi,..., Pp} be the ordered 
partition. Thus, we have Pi = Si for i G [0, mj. □ 

Theorem [ 6 ] shows that for datasets that have at least 
one domain element having a count equal to i for all 
i £ [ 0 , fc], we can exactly tell the counts for those domain 
elements with count in [ 0 , k — 2 a] with high probability. 
A number of datasets satisfy the assumption that counts 
in Jo, k] all have support. For instance, for datasets that 
are drawn from a Zipfian distribution, the size of Si is 
in expectation inversely proportional to the count i, and 
thus all small counts will have support for datasets of 
sufficiently large size. 

We also find that a number of real world datasets sat¬ 
isfy the assumption that counts in [ 0 , k] all have support. 


Datasets 

Domain 

Scale 

k 

Adult 

4096 

17665 

15 

MedicalCost 

4096 

9415 

26 

Income 

4096 

20787122 

98 

HEPTH 

4096 

347414 

275 


Table 1: Overview of some real world datasets 


Algorithm 4 Reconstruct Algorithm 
Input: Dataset D with domain T, privacy param e 
Output: A reconstruction of D 
1: Split the budget e = ei -|- 62 
2'. V £- Attack Algorithm{D, ei) 

3: for P £ P do 

4: cp ■£- count{P) -|- Lap{.^) 

5: Up <— 

6 : We guess the count of cells in P is round{ap) 

7: end for 


Table [T] shows the features of some real world datasets. 
Adult is a histogram constructed from U.S. Census 
data [3 on the “capitol loss” attribute. MedicalCost is 
a histogram of personal medical expenses from the sur¬ 
vey m- Income is the histogram on “personal income” 
attribute from [ 8 ]. HEPTH is a histogram constructed 
using the citation network among high energy physics 
pre-prints on arXiv. The attributes Domain and Scale 
in Table [T] correspond to the size of T and the number 
of tuples in the datasets. The feature k for each dataset 
means that Si for all i £ [0, fc]. 

However, the above attack assumes some prior knowl¬ 
edge about the dataset. Specifically, we assume that the 
attacker know k such that all counts in [ 0 , fc] have sup¬ 
port in the input dataset. Next we present an extension 
of our attack that allows reconstructing counts in the 
dataset without any prior knowledge about the dataset, 
but with differentially private access to the dataset. 
Emperical Data Reconstruction 

Algorithm [T] outlines an attack for reconstructing a 
dataset using GPTT and differentially private access to 
the dataset. Algorithm |T] also takes as input a privacy 
budget e. We split the budget e = ei -|- £ 2 . We use 
£1 to run our GPTT based attack algorithm (Alg|3}, 
which outputs an ordered partition P of the domain. 
We use the remaining budget £2 to compute noisy total 
counts Cp for each partition P £ P, and estimate the 
average count in each partition ap = cp/\P\. We round 
this average to the nearest integer and guess that each 
domain element Ui £ P has count round{ap). 

We run Algorithm [4] on the four datasets. Table [2] 
shows the fraction of domain elements in each dataset 
whose counts are correctly guessed by our algorithm for 
£ £ (1,0.5, 0.1}. Each accuracy measure is the average 
of 10 repetitions. In each experiment, we set £1 = £2 = 
0.5£. When e is not small, the counts of most cells can 
be reconstructed (3 datasets can even be reconstructed 
over 90% ). With the decreasing of the e, the ratio 
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£ = 1.0 

£ = 0.5 

£ = 0.1 

Adult 

0.994 

0.991 

0.981 

MedicalCost 

0.985 

0.977 

0.949 

Income 

0.798 

0.741 

0.636 

HEPTH 

0.904 

0.795 

0.477 


Table 2: Emperical datasets reconstruction on 
different e 



# of cells 

£= 1.0 

£ = 0.5 

£ = 0.1 

Adult 

4062 

0.999 

0.997 

0.992 

MedicalCost 

3878 

1.0 

1.0 

0.960 

Income 

2369 

1.0 

1.0 

0.979 

HEPTH 

1153 

1.0 

1.0 

0.970 


Table 3: Emperical reconstruction accuracy on 
cells with small counts within [0, 5] 


becomes smaller. Two reasons explain this result: (1) 
Small £1 leads to a coarser partition from the Attack 
Algorithm. (2) Small £2 introduces much noise to the 
counts of each group giving us wrong counts. 

Table[3]displays the accuracy of reconstruction on do¬ 
main elements with small counts within [0, 5]. Note that 
more than 1/4*^ of the domain has counts in [0,5] for 
all the datasets. More than 95% of all domain elements 
with small counts within [0, 5] can be reconstructed for 
all these 4 datasets under all settings of £ considered. 
Especially, when the £ is not small (e.g.,£ = 1.0), nearly 
all these cells can be accurately reconstructed by using 
Algorithm (4) 

Discussion: These results show that not only does 
GPTT not satisfy differential privacy, it can lead to sig¬ 
nificant loss of privacy. Since cells with small counts 
can be reconstructed with very high accuracy (> 95%), 
access to the data via GPTT can result in releasing 
query answers that can allow re-identification attacks. 
Hence, we believe that systems whose privacy stems 
from GPTT are not safe to use. 


6. CONCLUSION 

We studied the privacy properties of a variant of the 
sparse vector technique called generalized private thresh¬ 
old testing (GPTT). This technique is claimed to satisfy 
differential privacy and has impressive utility proper¬ 
ties and has found applications in developing privacy 
preserving algorithms for frequent itemset mining, syn¬ 
thetic data generation and feature selection in machine 
learning. We show that the technique does not satisfy 
differential privacy. Moreover, we present attack algo¬ 
rithms that allows us to reconstruct counts from the 
input dataset (especially small counts) with high accu¬ 
racy with no prior knowledge about the dataset. Thus, 
we demonstrate that GPTT is not a safe technique to 
use on datasets with privacy concerns. 
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